Regression Analysis

Linear Models

The overall goal for our introduction here, is to develop models that may be parameterized by one or more terms. The most simple model:

\[ y = \beta_0 + \beta_1 x_1 + \epsilon \]

which includes:

  • An intercept ( \(\beta_0\) ),

  • A slope coefficient ( \(\beta_1\) ), and

  • And an error term ( \(\epsilon\) ).

Linear Models

Building Models

Building A Model - Search Criterion

Searching Random Model Space

Searching Random Model Space

       beta0      beta1      dist
1  0.8845133  1.6045973 22.814578
2 28.5444125 -4.9872329 37.977087
3  9.3374831  2.7945715  8.767417
4 36.8241342 -4.6211640 29.551357
5 -2.1537706  0.6587616 31.216438
6 36.3533108  0.3825870 10.693640

Top 10 Random Models

Take the 10 models with the smallest sum of squared distances.

The Best Coefficients

Our Friend Least Squares Regression

 

Specifying a Formula

For R to perform these analyses for us, we need to be able to specify the function that we will use to represent the data (just like we did for plot(y~x)). Assuming the predictor is named x and the response is y, we have:

Single Predictor Model

y ~ x

Multiple Additive Predictors

y ~ x1 + x2 

Interaction Terms

y ~ x1 + x2 + x1*x2

Fitting A Model

The easiest model.


Call:
lm(formula = y ~ x, data = df)

Coefficients:
(Intercept)            x  
     17.280        2.625  

Model Summaries

Summaries of the overal model components.


Call:
lm(formula = y ~ x, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.9836 -4.0182 -0.8709  5.3064  6.9909 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   17.280      4.002   4.318  0.00255 **
x              2.626      0.645   4.070  0.00358 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.859 on 8 degrees of freedom
Multiple R-squared:  0.6744,    Adjusted R-squared:  0.6337 
F-statistic: 16.57 on 1 and 8 DF,  p-value: 0.003581

Components within the Summary Object

Like cor.test(), the object returned from lm() and its summary object have several internal components that you may use.

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"        

 

 [1] "call"          "terms"         "residuals"     "coefficients" 
 [5] "aliased"       "sigma"         "df"            "r.squared"    
 [9] "adj.r.squared" "fstatistic"    "cov.unscaled" 

Components within the Summary Object

The probability can be found by looking at the data in the F-Statistic and then asking the F-distribution for the probability associated with the value of the test statistic and the degrees of freedom for both the model and the residuals.

   value    numdf    dendf 
16.56838  1.00000  8.00000 

The Summary Object

When you print out the summary( fit ) object, it uses the fstatistic to estimate a P-value. If we need that P-value directly, this is how it is done.

[1] 0.0035813

So You Have a Model, Now What?

Model Diagnostics - Residuals

Normality Of the Data

Leverage

Decomposition of Variance

The terms in this table are:

  • Degrees of Freedom (df): representing 1 degree of freedom for the model, and N-1 for the residuals.

Decomposition of Variance

The terms in this table are:

  • Sums of Squared Deviations:
    • \(SS_{Total} = \sum_{i=1}^N (y_i - \bar{y})^2\)
    • \(SS_{Model} = \sum_{i=1}^N (\hat{y}_i - \bar{y})^2\), and
    • \(SS_{Residual} = SS_{Total} - SS_{Model}\)

Decomposition of Variance

The terms in this table are:

  • Mean Squares (Standardization of the Sums of Squares for the degrees of freedom)
    • \(MS_{Model} = \frac{SS_{Model}}{df_{Model}}\)
    • \(MS_{Residual} = \frac{SS_{Residual}}{df_{Residual}}\)

Decomposition of Variance

The terms in this table are:

  • The \(F\)-statistic is from a known distribution and is defined by the ratio of Mean Squared values.

  • Pr(>F) is the probability associated the value of the \(F\)-statistic and is dependent upon the degrees of freedom for the model and residuals.

Decomposition of Variance

The classis ANOVA Table

Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value   Pr(>F)   
x          1 568.67  568.67  16.568 0.003581 **
Residuals  8 274.58   34.32                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Variance Explained


Call:
lm(formula = y ~ x, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.9836 -4.0182 -0.8709  5.3064  6.9909 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   17.280      4.002   4.318  0.00255 **
x              2.626      0.645   4.070  0.00358 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.859 on 8 degrees of freedom
Multiple R-squared:  0.6744,    Adjusted R-squared:  0.6337 
F-statistic: 16.57 on 1 and 8 DF,  p-value: 0.003581

Relationship Between \(R^2\) & \(r\)

How much of the variation is explained?

\[ R^2 = \frac{SS_{Model}}{SS_{Total}} \]

     Regression R^2 Squared Correlation 
          0.6743782           0.6743782 

The square of the Pearson Correlation is equal to R

Helper Functions

Grabbing the predicted values \(\hat{y}\) from the model.

Helper Functions - Residuals

The residuals are the distances between the observed value and its corresponding value on the fitted line.

Comparing Models

What Makes One Model Better

There are two parameters that we have already looked at that may help. These are:

  • The \(P-value\): Models with smaller probabilities could be considered more informative.

  • The \(R^2\): Models that explain more of the variation may be considered more informative.

New Data Set

Let’s start by looking at some air quality data, that is built into R as an example data set.

     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       

Base Models - What Influences Ozone

Individually, we can estimate a set of first-order models.

\[ y = \beta_0 + \beta_1 x_1 + \epsilon \]

Base Models - What Influences Ozone

Model parameters predicting mean ozone in parts per billion mearsured in New York during the period of 1 May 1973 - 30 September 1973 as predicted by Temperature, Windspeed, and Solar Radiation.
Model R2 P
Ozone ~ Solar 0.121 1.79e-04
Ozone ~ Temp 0.488 0.00e+00
Ozone ~ Wind 0.362 9.27e-13

More Complicated Models

Multiple Regression Model - Including more than one predictors.

\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\)

More Complicated Models

Model parameters predicting mean ozone in parts per billion mresured in New York during the period of 1 May 1973 - 30 September 1973.
Model R2 P
Ozone ~ Solar 0.121 1.79e-04
Ozone ~ Temp 0.488 0.00e+00
Ozone ~ Wind 0.362 9.27e-13
Ozone ~ Temp + Wind 0.569 0.00e+00
Ozone ~ Temp + Solar 0.510 0.00e+00
Ozone ~ Wind + Solar 0.449 9.99e-15

For Completeness

How about all the predictors. \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\)

For Completeness

Model R2 P
Ozone ~ Solar 1.21e-01 1.79e-04
Ozone ~ Temp 4.88e-01 0.00e+00
Ozone ~ Wind 3.62e-01 9.27e-13
Ozone ~ Temp + Wind 5.69e-01 0.00e+00
Ozone ~ Temp + Solar 5.10e-01 0.00e+00
Ozone ~ Wind + Solar 4.49e-01 9.99e-15
Ozone ~ Temp + Wind + Solar 6.06e-01 0.00e+00

\(R^2\) Inflation

Any variable added to a model will be able to generate Sums of Squares (even if it is a small amount). So, adding variables may artifically inflate the Model Sums of Squares.

Example:

What happens if I add just random data to the regression models? How does \(R^2\) change?

Random Data Effects

Original data in models.
Models R2
Ozone ~ Temp 0.4877
Ozone ~ Wind 0.3619
Ozone ~ Solar 0.1213
Ozone ~ Temp + Wind 0.5687
Ozone ~ Temp + Solar 0.5103
Ozone ~ Wind + Solar 0.4495
Ozone ~ Temp + Wind + Solar 0.6059

Random Data Effects

Original full model + X random variables
Models R2
Ozone ~ Temp + Wind + Solar + 1 Random Variables 0.6085
Ozone ~ Temp + Wind + Solar + 2 Random Variables 0.6103
Ozone ~ Temp + Wind + Solar + 3 Random Variables 0.6110
Ozone ~ Temp + Wind + Solar + 4 Random Variables 0.6139
Ozone ~ Temp + Wind + Solar + 5 Random Variables 0.6144
Ozone ~ Temp + Wind + Solar + 6 Random Variables 0.6237
Ozone ~ Temp + Wind + Solar + 7 Random Variables 0.6246
Ozone ~ Temp + Wind + Solar + 8 Random Variables 0.6250

Perfect - My Models RULE

I can just add random variables to my model and always get an awesome fit!

 

 

Not so fast Bevis!

Model Comparisons

Akaike Information Criterion (AIC) is a measurement that allows us to compare models while penalizing for adding new parameters.

\[AIC = -2 \ln L + 2p\]

The criterion here are to find models with the lowest (absolute) AIC values.

Model Comparisons

To compare, we evaluate the differences in AIC for alternative models.

\(\delta AIC = AIC - min( AIC )\)

Interpretation

  • \(0 < \delta AIC\; < 2.0\): Models for consideration.
  • \(2.0 < \delta AIC\; < 5.0\): May be of interest.
  • \(\delta AIC > 5.0\): Not to be considered.

AIC & ∂AIC

Models R2 AIC deltaAIC
Ozone ~ Temp 0.488 1067.706 68.989
Ozone ~ Wind 0.362 1093.187 94.470
Ozone ~ Solar 0.121 1083.714 84.997
Ozone ~ Temp + Wind 0.569 1049.741 51.024
Ozone ~ Temp + Solar 0.510 1020.820 22.103
Ozone ~ Wind + Solar 0.449 1033.816 35.098
Ozone ~ Temp + Wind + Solar 0.606 998.717 0.000
Ozone ~ Temp + Wind + Solar + 1 Random Variables 0.608 999.988 1.271
Ozone ~ Temp + Wind + Solar + 2 Random Variables 0.610 1001.467 2.750
Ozone ~ Temp + Wind + Solar + 3 Random Variables 0.611 1003.258 4.541
Ozone ~ Temp + Wind + Solar + 4 Random Variables 0.614 1004.425 5.708
Ozone ~ Temp + Wind + Solar + 5 Random Variables 0.614 1006.290 7.573
Ozone ~ Temp + Wind + Solar + 6 Random Variables 0.624 1005.593 6.876
Ozone ~ Temp + Wind + Solar + 7 Random Variables 0.625 1007.326 8.609
Ozone ~ Temp + Wind + Solar + 8 Random Variables 0.625 1009.188 10.471

Questions

If you have any questions, please feel free to either post them as an “Issue” on your copy of this GitHub Repository, post to the Canvas discussion board for the class, or drop me an email.

Peter Sellers looking bored